Morphological Analysis of the Slovak National Corpus
نویسنده
چکیده
1. Basis of a morphological analysis of the Slovak National Corpus A question of morphological (or morphosyntactic) analysis has been a key problem for natural language processing (NLP) for several years. Automatic morphological annotation is a useful tool especially with regard to the corpus data processing. In this respect morphological annotation has been considered also during the development of the Slovak National Corpus (SNC). Theoretical aspects of morphological analysis and its application in corpus tagging associated with the morphological tagset preparation for the manual tagging of SNC were outlined by M. Forróová and A. Horák (2003, in press). Annotation, generally understood as the process of adding some information to texts, is undoubtedly a convenient tool – in spite of different views – in verification of linguistic theories, but also in carrying out various lexicographic projects. Morphological analysis is generally understood as an assignment of a base form (lemmatization), classification of words into grammatical and semantic classes and assignment of grammatical categories to words in texts (in a form of tags). Generally, for a language-competent person this kind of analysis is not difficult, on the other hand for computer processing it is a hardnut task (Foróová – Horák, 2003). At the same time we take into account the problem of formal homonymy resulting from automatic morphological annotation; this problem requires a subsequent disambiguation. From the very beginning we were aware of the fact that a set of morphological tags should have represented language properties of a text, in other words, somehow it should have interpreted a text. We should decide whether the proposed tagset would result in a new formal description of language, or if we reflect to the current existing linguistic descriptions, and tried to formalise them. Foróová and Horák (2003) point to 7 maxims proposed by G. Leech, providing regularity of annotation, and guarantee that annotation wouldn’t result in misinterpretations of corpus data. We would like to emphasize a maxim of accessibility and a maxim of consensus of academic theories, theoretical “neutrality” which was predominantly taken into consideration when preparing a tagset. The central issue can be formulated as follows: to which extent we can refer to the traditional grammatical descriptions of Slovak morphology when preparing a lemmatization and a tagset? We considered relevant to take into account a systemic description made by academic Morfológia slovenského jazyka (lit. Morphology of the Slovak language, MSJ, 1966), eventually some other works dealing with morphology (Oravec – Bajzíková – Furdík, 1984; Dvonč, 1984). The conflict between representatives of traditional grammatical categories and a possibility of automatic language processing is reflected also in approach of morphological tagset for SNC. M.
منابع مشابه
5 th Workshop on Intelligent and Knowledge oriented Technologies
The article presents current state of affairs in several projects conducted by the Slovak National Corpus department of the L’. Štúr Institute of Linguistics, Slovak Academy of Sciences. We describe the Slovak National Corpus, Corpus of Spoken Slovak, tools used for linguistics analysis and an ongoing effort to create Slovak WordNet. 1 Slovak National Corpus The Slovak National Corpus is a huge...
متن کاملSlovak Morphosyntactic Tagset
Morphological annotation constitutes essential, very useful and very common linguistic information presented in corpora, especially for highly inflectional languages. The morphological tagset used in the Slovak National Corpus has been designed with several goals in mind – the tags are compact and easily human-readable, without sacrificing their informational contents. The tags consist of ASCII...
متن کاملSlovak National Corpus tools and resources
The article presents current state of affairs in several projects conducted by the Slovak National Corpus department of the L’. Štúr Institute of Linguistics, Slovak Academy of Sciences. We describe the Slovak National Corpus, Corpus of Spoken Slovak, tools used for linguistics analysis and an ongoing effort to create Slovak WordNet. 1 Slovak National Corpus The Slovak National Corpus is a huge...
متن کاملComparison of two different techniques of warfarin dosing determination - A chemometrics study
A high prevalence of genetic polymorphisms increases sensitivity to warfarin therapy. In this study, we investigated 47 patients with effective long-term therapy by warfarin well-controlled by monitoring of International Normalised Ratio (INR). All patients were tested for gene polymorphisms VKORC1, CYP2C9*C2, and CYP2C9*C3, which were used for a dose calculation employing a program www.Warfari...
متن کاملComparison of two different techniques of warfarin dosing determination - A chemometrics study
A high prevalence of genetic polymorphisms increases sensitivity to warfarin therapy. In this study, we investigated 47 patients with effective long-term therapy by warfarin well-controlled by monitoring of International Normalised Ratio (INR). All patients were tested for gene polymorphisms VKORC1, CYP2C9*C2, and CYP2C9*C3, which were used for a dose calculation employing a program www.Warfari...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006